Active Learning for Chinese Word Segmentation

نویسندگان

  • Shoushan Li
  • Guodong Zhou
  • Chu-Ren Huang
چکیده

Currently, the best performing models for Chinese word segmentation (CWS) are extremely resource intensive in terms of annotation data quantity. One promising solution to minimize the cost of data acquisition is active learning, which aims to actively select the most useful instances to annotate for learning. Active learning on CWS, however, remains challenging due to its inherent nature. In this paper, we propose a Word Boundary Annotation (WBA) model to make effective active learning on CWS possible. This is achieved by annotating only those uncertain boundaries. In this way, the manual annotation cost is largely reduced, compared to annotating the whole character sequence. To further minimize the annotation effort, a diversity measurement among the instances is considered to avoid duplicate annotation. Experimental results show that employing the WBA model and the diversity measurement into active learning on CWS can save much annotation cost with little loss in the performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Simple and Effective Closed Test for Chinese Word Segmentation Based on Sequence Labeling

In many Chinese text processing tasks, Chinese word segmentation is a vital and required step. Various methods have been proposed to address this problem using machine learning algorithm in previous studies. In order to achieve high performance, many studies used external resources and combined with various machine learning algorithms to help segmentation. The goal of this paper is to construct...

متن کامل

Incorporating Global Information into Supervised Learning for Chinese Word Segmentation

This paper presents a novel approach to Chinese word segmentation (CWS) that attempts to utilize global information (GI) such as co-occurrence of sub-sequences and outputs of unsupervised segmentation in the whole text for further enhancement of the state-of-the-art performance of conditional random fields (CRF) learning. In the existing work of CWS, supervised and unsupervised learning seldom ...

متن کامل

An Efficient Chinese Parsing Algorithm for Computer-Assisted Language Learning

Instructional grammar is often used in Computer-assisted Language Learning (CALL) and the grammatical error detection is an important feature. However, it is not an easy task in Chinese language. There is no delimiter separating consecutive words in Chinese sentences. Word segmentation is a process in which proper word boundaries are identified. Before syntactic parsing of a Chinese sentence, w...

متن کامل

Text Window Denoising Autoencoder: Building Deep Architecture for Chinese Word Segmentation

Deep learning is the new frontier of machine learning research, which has led to many recent breakthroughs in English natural language processing. However, there are inherent differences between Chinese and English, and little work has been done to apply deep learning techniques to Chinese natural language processing. In this paper, we propose a deep neural network model: text window denoising ...

متن کامل

Integrating Ngram Model and Case-based Learning for Chinese Word Segmentation

This paper presents our recent work for participation in the First International Chinese Word Segmentation Bakeoff (ICWSB-1). It is based on a generalpurpose ngram model for word segmentation and a case-based learning approach to disambiguation. This system excels in identifying in-vocabulary (IV) words, achieving a recall of around 96-98%. Here we present our strategies for language model trai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012